Semantic Segmentation of CMR Images using Deep Learning

Coursework Group:16
Group Members (in alphabetical order): Binglun Wang, Chaosui Peng, Huanbo Lyu, Huawei Li, Ruizhi Li, Shuxin Wang, Xinyan Xiang

1. Introduction

In digital image processing and computer vision, semantic segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share common characteristics. Semantic segmentation of magnetic resonance (MR) images is implemented for this task based on deep neural networks.

The dataset in this task consists of 200 grayscale intensity images at 96 × 96 resolution and 120 corresponding masks. To be specific, 100 for training (with masks), 20 for validation (with masks) and 80 for testing (without masks). The objective is to segment the original images into 4 classes pixelwise including the background (label: 0), the left ventricle (LV, label: 3), the right ventricle (RV, label: 1) and the myocardium (Myo, label: 2). 

To begin with, we dived into several different network architectures to compete their performance and generalisation capicity. By comparing these models, the optimal one for this task is applied on the test set. This process can provide us a direction for optimisation, as it could help us to know more about segmentation networks and optimization approaches in various papers.

The final performance of model is measured by an average Dice Score computed on Kaggle based on 80 test cases, which ranges from 0 to 1 (the higher, the better). We aim to achieve 0.85 in terms of average Dice Score.

2. Implementation

2.1. Network Selection

Some efficient architectures for semantic segmentation have been selected, which are Fully Convolutional Networks (Long et al, 2015), SegNet (Badrinarayanan et al., 2017), DeepLab (Chen at al., 2017; Chen at al., 2018) and UNet (Ronneberger  et al., 2015; Hang et al., 2020) and their variations. 

Fully Convolutional Networks (FCN), as the first end to end fully convolutional networks, can solve the problem of missed spatial information taking input of fixed size (Long et al., 2015). However, it is less sensitive to details in the image and leads to coarse segmentation maps. Therefore, more work is needed to achieve better performance.  

SegNet can improve the segmentation resolution by using encoder and decoder component in the architecture and raise memory efficiency by using pooling indices when doing upsampling (Badrinarayanan et al., 2017). We compared an advanced version consisting of the first 13 layers in VGG16 network for both encoder an decoder by a basic version of 4 convolutional layers for both encoder and decoder.

DeepLab (Chen et al., 2017) combines the advantages of spatial pyramid pooling module and encode-decoder structure. A novel version of it (DeepLab v3+) makes some changes to its backbone, such as using Resnet or Xception, which applies the depthwise separable convolution to both Atrous Spatial Pyramid Pooling and decoder blocks (Chen et al., 2018). In UNet (Ronneberger  et al., 2015) and its later version (UNet 3+), the combination of semantic information in deep, coarse layers and appearance information in shallow, fine layers is realized by plain skip connection (fully skip connection and fully deep supervisions in U-Net 3+ (Hang et al., 2020)); thus better performance can be achieved on medical image segmentation.  

For the architecture combining U-Net and U-Net 3+ which gets the best results on the test set, we would focus more on these networks when explaining our implementation in details. 

2.2. Network Architecture

image.png

Fig. 1. Architectures of UNet (a) and UNet 3+ (b).

In our bagging model, two networks are ultilised to produce a final weighted output of the results produced by UNet and UNet 3+ (shown in Fig.1). In this achitecture, a basic version of UNet 3+ is applied, which ignores the bilinear interpolation in the full-scale skip connections, the full-scale deep supervision and classification-guided module (CGM). Furthermore, the last one layer of UNet is removed.

Since both networks use encoder-decoder structure, two main works of building contracting path and expansive path are carried out.

In UNet architecture, each block in its contracting path is constituted by several classic convolutional layers (each followed by a rectified linear unit (ReLU)), ending with a downsampling step using max pooling operation. The method used in the encoding process of UNet 3+ is identical to that applied in UNet (except for the number of layers).

However, there are several differences on how they deal with the decoder. In the expansive path of UNet, each block consists of two components, which are a concatenation operation that makes use of the corresponding cropped feature map from the contracting path and a series of traditional operations (upsampling the feature map for concatenation, followed by several convolutional layers and ReLU). While in UNet 3+, it combines the multi-scale features in its skip connections, which is different from plain connections used in UNet. This full-scale skip connections incorporate information from both encoder and decoder itself to construct a feature map. By using the feature map from the same-scale encoder layer and the low-level detailed information from the smaller-scale encoder layer, and applying a max pooling operation, a batch normalization and a ReLU function, the final segmentation map is generated.

2.3. Implementation

To apply the above networks to this segmentation task, a number of preparations are necessary, i.e., the implementation and configuration of our model candidates, the optimizer including the initialization of learning rate and weight decay, the loss function, and several other hyperparameters such as batch size, dropout, and maximum number of epochs.

image.png

Fig. 2. Implementation.

After loading training and validation data with both images and masks included, we begin our training process. In each epoch, we first set the gradients to zero, and then the model parameters are updated by the optimizer using a self-defined loss function that helps to decide every proper step we take (shown in Fig. 2). In order to investigate how the performance of our model, the time cost, the Dice Score and the average values of the self-defined loss of both training and validation are recorded, such that we are able to select a model that performs best among all the model candidates (that is bagging of UNet and UNet 3+ in our case). Furthermore, confusion matrix of each model can also be used to evaluate the accuracy of segmentation.

To provide more evidence to assess the generalization performance of the model, we run it on the test set containing 80 test data, and by computing the Dice Score, we have a more objective evaluation of our model.

2.4. Associated Performance Analysis Mechanisms

Ranging from 0 to 1, the Dice Score evaluates the segmentation performance by comparing the predicted mask with the ground truth. As given in the instruction file (shown in Fig. 3), we compute the average Dice Score of training set, validation set and test set by computing the overlap area of our predicted output mask (X) and ground truth (Y). The average Dice Score for each example is computed as the average of the Dice Score of 3 labels other than the background (0).

image.png

Fig. 3. Dice Score

Our self-defined loss function is defined by combining cross-entropy loss and Dice Score (shown in Fig. 4). By calculating the average loss and Dice Score of training set and validation set after training each epoch, a gap can be noticed between two curves (curves on behalf of training set and validation set). Drawing on this gap, we can evaluate the generalization performance of a model.

image-2.png

Fig. 4.Self-defined loss function

To further evaluate the precision of our model, we calculate a confusion matrix in terms of 4 labels by comparing predcited labels with ground truth (shown in Fig. 5).

image-11.png image-12.png image-13.png image-10.png

Fig. 5. Confusion matrix

By setting a timer at the begining and the end of a process with fixed dataset (training or inference), and calculating the size of the total parameters produced, we can compute the computational efficiency of a model.

3. Experiment

3.1. Optimization Perspective

3.1.1. Data augumentation

image.png image-2.png

Fig. 6. Random flipping horizontally

By randomly flipping our training images horizontally, we improve the capacity of extraction of features. An obvious improvement can be seen in the comparison of confusion matrix calculated before and after data argumentation (shown in Fig. 7). The average Dice Score of the validation set raise from 0.88459 to 0.89084.

image-3.png image-4.png

Fig. 7. Confusion matrix : comparison_flipping

After downsampling through 5 layers in UNet, the size of our input (96 × 96) becomes quite small, therefore we tried removing the last two layers of the encoder. Meanwhile, we tried expanding the size of the input image (shown in Fig. 8) so that more information can be preserved in this process, but the output is not satisfying enough. The changes made to the contrast and brightness of the input image are as follows

image-7.png

Fig. 8. Size changing

As shown in the comparison of confusion matrix calculated before and after increasing the brightness of images with data argumentation (shown in Fig. 9), the performance has not been significantly improved, which does not meet our expectation (from 0.89084 Fig. 9(a) to 0.88920 Fig. 9(b)).

image-5.png

(a)

image-6.png

(b)

Fig. 9. Confusion matrix : comparison_size changing

3.1.2. Optimizer

This method is less satisfied when adapting to our models, which is very time-consuming to find the optima.

image-7.png

Fig. 10. AdamW

It updates the learning rate directly (shown in Fig. 11).

image-2.png

Fig. 11. Self-defined function to adapt learning rate

We update the learning rate according to the number of epochs utilizing a self-defined optimizer (shown in Fig. 12), which is able to reduce the effect of overshooting the minima in the optimisation process.

image-3.png

Fig. 12. Updating the learning rate

3.1.3. Loss function

image.png

Fig. 13. Cross-entropy loss function

Since the final generalization performance is measured by a Dice Score on test set, we combine this Dice Score with cross-entropy loss together as a self-defined loss function to update our weights.

3.1.4. Model modification

Since each network has gone through a number of upgrades (for example, UNet and DeepLab), different achitectures and approaches have been used in various versions. To better train a model to segment our input images, some adjustments are made to modify the structure of the networks we used.

For example, as mentioned in section 2.2, we modify the number of layers in UNet. As the size of our input images is 96 × 96 × 1, it is compressed to 6 × 6 × 1024 after four blocks of convolutions and maxpooling, which might lose information of some essential features. After removing the last one layer of the encoder and modify the decoder accordingly, we stop downsampling when the size reaches 12 × 12 × 512. The model can achieve better performance with this structure adopted.

3.1.5. Bagging model

A new output can be produced by combing the outputs (segmentation map) produced from two networks with an assigned weight respectively.(shown in Fig. 14).

image.png

Fig. 14. Bagging model

3.2. Optimization Process of Networks

From the prospective of optimization, a fixed and popular progress is applied during optimization, including modification of loss function and optimizer updating. Moreover, work related to data pre-progressing and model restructuring is also a good aspect of optimisation.

For example, in the structure of bagging of UNet and UNet 3+, we aim to combine the ability to extract both semantic information and appearance information of these two networks. Making use of their strengths, we come out with this method to produce a new map.

Two more detailed processes of optimization is described as following (Taking DeepLab and SegNet as examples):

3.2.1. DeepLab

In DeepLab, ResNet and Xception are adopted as the backbone to optimize our segmentation separately.

Firstly, the effects of different output strides on the model are investigated by adopting the same loss function and optimizer in the Resnet model. We choose the parameters of output stride 8 and 16, thus finding that the former one is better at performance and its loss decreases faster during traing. Then we change other hyperparameters and use our self-defined function as below:

Loss Function = nn.CrossEntropyLoss( ) + (3 - total_categorical_dice( ) / 7 )

Using Adam with adaptive learning rate(self-defined) as optimizer,the maximum average Dice Score of the validaton set arrive at 0.74 before overfitting.

The model using Xception is optimized similarily, in which the self-defined loss function is defined below:

Loss Function = nn.CrossEntropyLoss( ) + (3 - total_categorical_dice( ) / 9 )

Before overfitting, the maximum average Dice Score of the validaton set can achieve 0.81 via Adam with learning rate 0.001.

3.2.2. SegNet

Several optimization approaches is used to achieve better performance and minimize the extent of overfitting for SegNet, and the results are concluded in the table below.

It can be observed that when adopting the dropout layer and our self-defined loss with the optimizer Adam, the average Dice Score on the validation set can be improved to 0.8 (shown in bold in table).

image.png

Segnet-basic is a shallower version of Segnet, which adopts the first 13 layers of vgg16 for the encoder. Compared to Segnet, basic version (Segnet–basic) has less layers for its encoder and the same number of layers for the decoder. A light version is also designed with a smaller kernel size and padding in encoders and decoders.

kernel_size=3, padding=1
In addition to tuning hyperparameters of SegNet, Segnet-basic gets results as follows.

image.png

Making the Segnet shallower and lighter can achieve better performance. However, it suffers severely from the problem of overfitting.

In conclusion, the fitting capacity and generalization capacity of Segnet, including Segnet-basic, is hard to make significant improvement. It best achieves Dice Score 0.852 with overfitting for this scenario, which indicates that Segnet may have some vital defects and is not optimal for this segmentation task.

3.3. Comparison of Networks

3.3.1. Examples of assessment of predictions on validation set

image-2.png

3.3.2. Statistical Result

Changes in loss and Dice Score in training process of networks under optimal settings ((a) and (b)). The confusion matrics (c) are acquired with optimal model parameters.

image-3.png image-2.png

3.3.3 Comparison of networks on test set

image-2.png

4. Conclusions

4.1. Crucial factors

Given a task (in this case, it is a medical semantic segmentation task), we need to be very careful when selecting a network to build our model. In the papers proposing these networks, experiments evaluating performance of networks on representative public datasets are summarized and the suggested scenes to apply corresponding networks are mentioned.

For instance, SegNet achieves a better performance in road scenes and this might be the reason why it is not suitable for this medical segmentation task, while UNet and UNet 3+ themselves are more frequently adapted in medical scenes. One latent factor that reduces the performance and generalisation capicity of SegNet in this task may be its not passing the feature map in each block of the encoder to the decoder like UNet and UNet 3+, thus losing some vital information. Also, complex achitectures like Deeplab may not perform well and overfit easily in this relatively simple segmentationscenario.

In a word, before choosing a network for a specific task, the context of the task and the application scenes of different network need to be considered carefully.

When we try optimizing our training by data augumentation of the training set, an obvious improvement of labeling is made and can be seen from the corresponding confusion matrix (as described in Section 3.1), which validates the idea that the size of training set matters.

As for the validation set, a larger number of data might contribute to a better comparison among models, and help to achieve better generalisation capicity when deciding which epoch to stop training.

4.2. The best generalization performance achieved

To evaluate the generalization performance of our models, we need to find a method to represent how well or how accurate the output of our model is on an unseen dataset, which is Dice Score on test set in our case.

The highest score achieved by our models is 0.89084. In addition, narrower gap between two cures (Section 3.2.2) can be observed for UNet and UNet 3+ during training.

4.3. Prospect

Rather than using a single model to segment an image, we produce an output segmentation map by combining the predicted mask from 2 networks with different weights (as described in Section 2.2), so that we can combine and balance the advantages of these two networks. Thus, more combination is worth trying, such as DeepLab. Further, the weight for each network can also be learned.

The critical information contained in an image is usually around the centre (especially in our case). Therefore, we can reconstruct a model consisting of two parallel pathways to produce two segmentation maps and then combine those as a more precise output.

5. Appendix

Some of the code comes from open source code on the network. We have partially modified it and applied it to this assignment. Only papers are listed in the Reference.

Other code we attached but didn't list here can be found in the GitHub.
Link : https://github.com/soda-bread/NC-Coursework-group-16

[1] Kervadec, H., Bouchtiba, J., Desrosiers, C., Granger, E., Dolz, J., & Ayed, I. B. (2019, May). Boundary loss for highly unbalanced segmentation. In International conference on medical imaging with deep learning (pp. 285-296). PMLR.

[2] Chen, L. C., Zhu, Y., Papandreou, G., Schroff, F., & Adam, H. (2018). Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV) (pp. 801-818).

[3] Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2017). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4), 834-848.

[4]Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision (pp. 2980-2988).

[5]Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431-3440).

[6]Badrinarayanan, V., Kendall, A., & Cipolla, R. (2017). Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(12), 2481-2495.

[7]Milletari, F., Navab, N., & Ahmadi, S. A. (2016, October). V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV) (pp. 565-571). IEEE.

[8]Huang, H., Lin, L., Tong, R., Hu, H., Zhang, Q., Iwamoto, Y., ... & Wu, J. (2020, May). Unet 3+: A full-scale connected unet for medical image segmentation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1055-1059). IEEE.

[9] Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

[10] Loshchilov, I., & Hutter, F. (2018). Fixing weight decay regularization in adam.

[CODE]

Group 16: Coursework for Cardiac MR Image Segmentation

0 Preparation

0.1 Load, show, and save images with OpenCV

0.2 Check the device

1 Define a segmentation model with Pytorch

1.1 Define a DataLoader

1.2 Define the Segmenatation Model: UNet and UNet 3+

1.3 Define the Dice Score, Loss function and Optimizer

1.3.1 Dice Score

1.3.2 Loss function

1.3.3 Optimizer

2 Training, Validating and Testing

2.1 Unet : Training and Validating

2.2 Unet 3+ : Training and Validating

2.3 Load the model

2.4 Performance evaluation

2.4.1 Define the confusion matrix

2.4.2 Plot the confusion matrix on the validating dataset

2.4.3 Plot the predicted images on the validating dataset

2.4.4 Testing

3 Submission

3.2 Submit the data